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AMENDMENTS THE CLAIMS 
The Assignee submits below a complete listing of the current claims, including marked-up 
claims with insertions indicated by underlining and deletions indicated by strikeouts and/or double 
bracketing. This listing of claims replaces all prior versions and listings of claims in the application: 

1 . (Currently amended) A program storage device readable by a machine, tangibly 
embodying a program of instructions executable by the machine to perform a method for speech 
synthesis that allows user specified pronunciations, the melJiod comprising: 

aligning a text string comprising a plurality of words and phenomes and a user sp e cified 
spok e n audio signal corresponding to a d e sir e d pronunciation of the t e xt string; 

extracting prosodic parameters firom said spok e n an audio signal corresponding to a 
pronimciation of a text string bv a user : 

extracting duration parameters bv aligning the audio signal with the text string; 

automatically generating a mark e d up text corresponding to th e spok e n audio signal at least 
one text-to-speech (TTS) input using the prosodic parameters and the duration parameters, wherein 
the at least one TTS input is formatted for use in synthesizing speech from the text string: and 

generating a synthetic speech waveform using the mark e d - up text at least one TTS input . 

2-3. (Cancelled) 

4. (Currently amended) The program storage device of claim 1, wherein the instructions for 
aligning comprise instructions for segmenting said spok e n the audio signal into time-segmented 
regions, wherein each time-segmented region is mapped to a corresponding phoneme. 

5. (Previously Presented) The program storage device of claim 1, wherein the alignment is 
performed using a Viterbi alignment process. 



6. (Canceled) 
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7. (Currently amended) The program storage device of claim 1, wherein the instructions for 
automatically generating a marked up t e xt at least one TTS input comprise instructions for directly 
specifying at least one portion of tite duration contours parameters and/o r the prosodic parameters as 
attoibute values for mark-up elements. 

8. (Currently amended) The program storage device of claim 1 , wherein the instructions for 
automatically generating a mark e d up t e xt at least one TTS input comprise instructions for 
assigning abstract labels to at least one portion of tiie duration contours parameters and/ or the 
prosodic parameters to generate a high-level markup. 

9. (Currently amended) The program storage device of claim 1 , wherein the marked up text at 
least one TTS input is generated using SSML (speech synthesis markup language). 

10. (Currently amended) The program storage device of claim 1, further comprising instructions 
for processing phonetic content of the spok e n audio signal to generate the synthetic speech 
waveform having a desired pronunciation. 

1 1 . (Currently amended) A method for speech synthesis that allows user specified 
pronunciations, the method comprising: 

aligning a text string comprising a plurality of words and phenom e s and a us e r sp e cified 
spoken audio signal corresponding to a desir e d pronunciation of the t e xt string; 

extracting prosodic parameters fi:om said spoken an audio signal corresponding to a 
pronunciation of a text string bv a user: 

extracting duration parameters bv aligning the audio signal with the text string: 

automatically generating a mark e d up text corresponding to th e spok e n audio signal at least 
one text-to-speech (TTS) input using the prosodic parameters and the duration parameters, wherein 
the at least one TTS input is formatted for use in synthesizing speech from the text string ; and 

generating a synthetic speech waveform using the marked up t e xt at least one TTS input . 
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12-13. (Canceled) 

14. (Currently amended) The method of claim 1 1, wherein aligning comprises extracting 
acoustic feature data from the spoken audio signal and time-aligning the spoken audio signal to the 
text string using the acoustic feature data. 

1 5 . (Previously Presented) The method of claim 1 1 , wherein aligning is performed using a 

Viterbi alignment process. 

16. (Canceled) 

17. (Currentiy amended) The method of claim 1 1 , wherein automatically generating a mark e d 
up t e xt at least one TTS input comprises directiy specifying at least one portion of tiie duration 
contours parameters and/o r the prosodic parameters as attribute values for mark-up elements. 

18. (Currently amended) The method of claim 1 1 , wherein automatically generating a marked 
up t e xt at least one TTS input comprises assigning abstiract labels to at least one portion of tiie 
duration contours parameters and/or ^ prosodic parameters to generate a high-level markup. 

1 9. (Currently amended) The method of claim 1 1 , wherein the marked up text at least one TTS 
input is generated using SSML (speech synthesis markup language). 

20. (Currentiy amended) The method of claim 1 1 , further comprising processing phonetic 
content of the spok e n audio signal to generate the synthetic speech waveform having a desired 
pronunciation. 

21 . (Currently amended) A text-to-speech (TTS) system that allows user specified 
pronunciations, comprising: 
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a prosody analyzer for determining prosodic parameters of a spoken an audio signal 
corresponding to a d e sired pronunciation by a user of an input text string and automatically 
generating a marked up to?ct at least one TTS system input corresponding to the spoken audio signal 
using the prosodic parameters, wherein the prosody analyzer comprises: 



a prosodic parameter extraction module for extracting the prosodic 
parameters from the audio signal, 

an alignment module for extracting duration parameters by aligning the input 
text string with the spok e n audio signal, 

a prosodic parameter e xtraction module for d e termining prosodic param e ter 




the at least one TTS system input using the prosodic parameters information to 
g e n e rate mark e d up t e xt and the duration parameters; and 
a TTS system for generating a synthetic wayeform using the marked up text at least one TTS 
system input . 

22. (Currentiy amended) The system of claim 21, further comprising a user interface that 
enables a user to input the spekrat audio signal and tiie input [[a]] text sting corresponding to the 
spok e n audio signal. 

23 . (Currently amended) The system of claim 2 1 , wherein the prosody analyzer processes 
phonetic content of the spok e n audio signal to generate the synthetic wayeform having a desired 
pronunciation. 

24-28. (Canceled) 




a conversion module for h 




generating 



29. (Currently amended) The program storage device of claim 33, wherem extracting acoustic 
feature data from said spok e n the audio signal comprises digitizing the spok e n audio signal into a 
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set of frames and transforming tiie digitized input waveforms audio signal into a set of feature 
vectors on a frame-by-frame basis. 

30. (Currently amended) The program storage device of claim 29, wherein transforming the 
digitized input includ e s audio signal comprises producing a 24-dimensional cepstra feature vector 
for every 10ms of the spoken audio signal, concatenating frames to the left and to the right of a 
current frame to augment a current cepstral vector, and reducing each augmented cepstral vector to a 
60-dimensional feature vector using linear discriminant analysis. 

3 1 . (Currently amended) The melhod of claim 34, wherein extracting acoustic feature data from 
said spoken the audio signal comprises digitizing the sj^eAssR audio signal into a set of frames and 
transforming tiie digitized input wav e forms audio signal into a set of feature vectors on a frame-by- 
frame basis. 

32. (Currently amended) The method of claim 3 1 , wherein transforming the digitized iftput 
includes audio signal comprises producing a 24-dimensional cepstra feature vector for every 10ms 
of the spok e n audio signal, concatenating frames to the left and to the right of a current frame to 
augment a current cepstral vector, and reducing each augmented cepstral vector to a 60-dimensional 
feature vector using linear discriminant analysis. 

33. (Cvirrently amended) The program storage device of claim 1 wherein the method further 
comprises extracting acoustic feature data from said spok e n the audio signal and wherein the 
aligning fijrther comprises outputting a set of duration contours. 

34. (Currently amended) The method of claim 1 1 fijrther comprising extracting acoustic feature 
data from said spok e n the audio signal and wherein the aligning fiirther comprises outputting a set 
of duration contours. 
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35. (Currently amended) The system of claim 21 wherein the prosody analyzer further 
comprises an acoustic feature extraction module that extracts acoustic feature data from said spok t 
the audio signal and wherein the alignment module uses said acoustic feature data to perform the 
aligning. 



